237 research outputs found

    Generalized Filtering Decomposition

    Get PDF
    This paper introduces a new preconditioning technique that is suitable for matrices arising from the discretization of a system of PDEs on unstructured grids. The preconditioner satisfies a so-called filtering property, which ensures that the input matrix is identical with the preconditioner on a given filtering vector. This vector is chosen to alleviate the effect of low frequency modes on convergence and so decrease or eliminate the plateau which is often observed in the convergence of iterative methods. In particular, the paper presents a general approach that allows to ensure that the filtering condition is satisfied in a matrix decomposition. The input matrix can have an arbitrary sparse structure. Hence, it can be reordered using nested dissection, to allow a parallel computation of the preconditioner and of the iterative process

    Accelerating Cosmic Microwave Background map-making procedure through preconditioning

    Get PDF
    Estimation of the sky signal from sequences of time ordered data is one of the key steps in Cosmic Microwave Background (CMB) data analysis, commonly referred to as the map-making problem. Some of the most popular and general methods proposed for this problem involve solving generalised least squares (GLS) equations with non-diagonal noise weights given by a block-diagonal matrix with Toeplitz blocks. In this work we study new map-making solvers potentially suitable for applications to the largest anticipated data sets. They are based on iterative conjugate gradient (CG) approaches enhanced with novel, parallel, two-level preconditioners. We apply the proposed solvers to examples of simulated non-polarised and polarised CMB observations, and a set of idealised scanning strategies with sky coverage ranging from nearly a full sky down to small sky patches. We discuss in detail their implementation for massively parallel computational platforms and their performance for a broad range of parameters characterising the simulated data sets. We find that our best new solver can outperform carefully-optimised standard solvers used today by a factor of as much as 5 in terms of the convergence rate and a factor of up to 44 in terms of the time to solution, and to do so without significantly increasing the memory consumption and the volume of inter-processor communication. The performance of the new algorithms is also found to be more stable and robust, and less dependent on specific characteristics of the analysed data set. We therefore conclude that the proposed approaches are well suited to address successfully challenges posed by new and forthcoming CMB data sets.Comment: 19 pages // Final version submitted to A&

    LU factorization with panel rank revealing pivoting and its communication avoiding version

    Get PDF
    We present the LU decomposition with panel rank revealing pivoting (LU_PRRP), an LU factorization algorithm based on strong rank revealing QR panel factorization. LU_PRRP is more stable than Gaussian elimination with partial pivoting (GEPP). Our extensive numerical experiments show that the new factorization scheme is as numerically stable as GEPP in practice, but it is more resistant to pathological cases and easily solves the Wilkinson matrix and the Foster matrix. We also present CALU_PRRP, a communication avoiding version of LU_PRRP that minimizes communication. CALU_PRRP is based on tournament pivoting, with the selection of the pivots at each step of the tournament being performed via strong rank revealing QR factorization. CALU_PRRP is more stable than CALU, the communication avoiding version of GEPP. CALU_PRRP is also more stable in practice and is resistant to pathological cases on which GEPP and CALU fail.Comment: No. RR-7867 (2012

    Hybrid static/dynamic scheduling for already optimized dense matrix factorization

    Get PDF
    We present the use of a hybrid static/dynamic scheduling strategy of the task dependency graph for direct methods used in dense numerical linear algebra. This strategy provides a balance of data locality, load balance, and low dequeue overhead. We show that the usage of this scheduling in communication avoiding dense factorization leads to significant performance gains. On a 48 core AMD Opteron NUMA machine, our experiments show that we can achieve up to 64% improvement over a version of CALU that uses fully dynamic scheduling, and up to 30% improvement over the version of CALU that uses fully static scheduling. On a 16-core Intel Xeon machine, our hybrid static/dynamic scheduling approach is up to 8% faster than the version of CALU that uses a fully static scheduling or fully dynamic scheduling. Our algorithm leads to speedups over the corresponding routines for computing LU factorization in well known libraries. On the 48 core AMD NUMA machine, our best implementation is up to 110% faster than MKL, while on the 16 core Intel Xeon machine, it is up to 82% faster than MKL. Our approach also shows significant speedups compared with PLASMA on both of these systems

    Kronecker Product Approximation Preconditioners for Convection-diffusion Model Problems

    Get PDF
    We consider the iterative solution of the linear systems arising from four convection-diffusion model problems: the scalar convection-diffusion problem, Stokes problem, Oseen problem, and Navier-Stokes problem. We give the explicit Kronecker product structure of the coefficient matrices, especially the Kronecker product structure for the convection term. For the latter three model cases, the coefficient matrices have a 2×22 \times 2 blocks, and each block is a Kronecker product or a summation of several Kronecker products. We use the Kronecker products and block structures to design the diagonal block preconditioner, the tridiagonal block preconditioner and the constraint preconditioner. We can find that the constraint preconditioner can be regarded as the modification of the tridiagonal block preconditioner and the diagonal block preconditioner based on the cell Reynolds number. That's the reason why the constraint preconditioner is usually better. We also give numerical examples to show the efficiency of this kind of Kronecker product approximation preconditioners

    Randomized block Gram-Schmidt process for solution of linear systems and eigenvalue problems

    Full text link
    We propose a block version of the randomized Gram-Schmidt process for computing a QR factorization of a matrix. Our algorithm inherits the major properties of its single-vector analogue from [Balabanov and Grigori, 2020] such as higher efficiency than the classical Gram-Schmidt algorithm and stability of the modified Gram-Schmidt algorithm, which can be refined even further by using multi-precision arithmetic. As in [Balabanov and Grigori, 2020], our algorithm has an advantage of performing standard high-dimensional operations, that define the overall computational cost, with a unit roundoff independent of the dominant dimension of the matrix. This unique feature makes the methodology especially useful for large-scale problems computed on low-precision arithmetic architectures. Block algorithms are advantageous in terms of performance as they are mainly based on cache-friendly matrix-wise operations, and can reduce communication cost in high-performance computing. The block Gram-Schmidt orthogonalization is the key element in the block Arnoldi procedure for the construction of Krylov basis, which in its turn is used in GMRES and Rayleigh-Ritz methods for the solution of linear systems and clustered eigenvalue problems. In this article, we develop randomized versions of these methods, based on the proposed randomized Gram-Schmidt algorithm, and validate them on nontrivial numerical examples

    Spherical harmonic transform with GPUs

    Get PDF
    We describe an algorithm for computing an inverse spherical harmonic transform suitable for graphic processing units (GPU). We use CUDA and base our implementation on a Fortran90 routine included in a publicly available parallel package, S2HAT. We focus our attention on the two major sequential steps involved in the transforms computation, retaining the efficient parallel framework of the original code. We detail optimization techniques used to enhance the performance of the CUDA-based code and contrast them with those implemented in the Fortran90 version. We also present performance comparisons of a single CPU plus GPU unit with the S2HAT code running on either a single or 4 processors. In particular we find that use of the latest generation of GPUs, such as NVIDIA GF100 (Fermi), can accelerate the spherical harmonic transforms by as much as 18 times with respect to S2HAT executed on one core, and by as much as 5.5 with respect to S2HAT on 4 cores, with the overall performance being limited by the Fast Fourier transforms. The work presented here has been performed in the context of the Cosmic Microwave Background simulations and analysis. However, we expect that the developed software will be of more general interest and applicability

    A 3D Parallel Algorithm for QR Decomposition

    Full text link
    Interprocessor communication often dominates the runtime of large matrix computations. We present a parallel algorithm for computing QR decompositions whose bandwidth cost (communication volume) can be decreased at the cost of increasing its latency cost (number of messages). By varying a parameter to navigate the bandwidth/latency tradeoff, we can tune this algorithm for machines with different communication costs

    Communication Avoiding Gaussian Elimination

    Get PDF
    This paper presents CALU, a Communication Avoiding algorithm for the LU factorization of dense matrices distributed in a two-dimensional (2D) cyclic layout. The algorithm is based on a new pivoting strategy, referred to as ca-pivoting, that is shown to be stable in practice. The ca-pivoting strategy leads to a significant decrease in the number of messages exchanged during the factorization of a block-column relatively to conventional algorithms, and thus CALU overcomes the latency bottleneck of the LU factorization as in current implementations like ScaLAPACK and HPL. The experimental part of this paper focuses on the evaluation of the performance of CALU on two computational systems, an IBM POWER 5 system with 888 compute processors distributed among 111 compute nodes, and a Cray XT4 system with 9660 dual-core AMD Opteron processors. We compare CALU with ScaLAPACK PDGETRF routine that computes the LU factorization. Our experiments show that CALU leads to a reduction in the parallel time of the LU factorization. The gain depends on the size of the matrices and on the characteristics of the computer architecture. In particular the effect is found to be significant in the cases when the latency time is an important factor of the overall time, as for example when a small matrix is executed on large number of processors. The factorization of a block-column, referred to as TSLU, reaches a performance of 215 GFLOPs/s on 64 processors of the IBM POWER 5 system, and a performance of 240 GFLOPs/s on 64 processors of the Cray XT4 system. It represents 44% and 36% of the theoretical peak performances on these systems. TSLU outperforms the corresponding routine PDGETF2 from ScaLAPACK up to a factor of 4.37 on the IBM POWER 5 system and up to a factor of 5.58 on the Cray XT4 system. On square matrices of order 10000, CALU outperforms PDGETRF by a factor of 1.24 on IBM POWER 5 and by a factor of 1.31 on Cray XT4. It represents 40% and 23% of the peak performance on these systems. The best improvement obtained by CALU is a speedup of 2.29 on IBM POWER 5 and a speedup of 1.81 on Cray XT4

    Two sides tangential filtering decomposition

    Get PDF
    AbstractIn this paper we study a class of preconditioners that satisfy the so-called left and/or right filtering conditions. For practical applications, we use a multiplicative combination of filtering based preconditioners with the classical ILU(0) preconditioner, which is known to be efficient. Although the left filtering condition has a more sound theoretical motivation than the right one, extensive tests on convection–diffusion equations with heterogeneous and anisotropic diffusion tensors reveal that satisfying left or right filtering conditions lead to comparable results. On the filtering vector, these numerical tests reveal that e=[1,…,1]T is a reasonable choice, which is effective and can avoid the preprocessing needed in other methods to build the filtering vector. Numerical tests show that the composite preconditioners are rather robust and efficient for these problems with strongly varying coefficients
    corecore